Flower, dog, anxious, senior, car, Item, president, worried, avacado, Zendaya,
licorice, Nerdfighter, toothbrush, zany, expedient,
This isn’t really a vlogbrothers video.
It’s just a random string of words.
There aren’t any coherent sentences.
It looks like John Green bot could use some help speaking a bit more like human John Green
- sounds like an excellent task for Natural Language Processing.
INTRO
Hey, I’m Jabril and welcome to Crash Course AI!
Today, we’re going to tackle another hands-on lab.
Our goal today is to get John-Green-bot to produce language that sounds like human John
Green… and have some fun while doing it.
We’ll be writing all of our code using a language called Python in a tool called Google
Colaboratory, and as you watch this video, you can follow along with the code in your
browser from the link we put in the description.
In these Colaboratory files, there’s some regular text explaining what we’re trying
to do, and pieces of code that you can run by pushing the play button.
Now, these pieces of code build on each other, so keep in mind that we have to run them in
order from top to bottom, otherwise we might get an error.
To actually run the code and experiment with changing it you’ll have to either click
“open in playground” at the top of the page or open the File menu and click “Save
a Copy to Drive”.
And just an fyi: you’ll need a Google account for this.
Now, we’re going to build an AI model that plays a clever game of fill-in-the-blank.
We’ll be able to give John-Green-bot any word prompt like “good morning,” and he’ll
be able to finish the sentence.
Like any AI, John-Green-bot won’t really understand anything, but AI generally does
a really good job of finding and copying patterns.
When we teach any AI system to understand and produce language, we’re really asking
it to find and copy patterns in some behavior.
So to build a natural language processing AI, we need to do four things:
First, gather and clean the data.
Second, set up the model.
Third, train the model.
And fourth, make predictions.
So let’s start with the first step: gather and clean the data.
In this case, the data are lots of examples of human John Green talking, and thankfully,
he’s talked a lot online.,
We need some way to process his speech.
And how can we do that?
Subtitles.
And conveniently there’s a whole database of subtitle files on the nerdfighteria wiki
that I pulled from.
I went ahead and collected a bunch and put them into one big file that’s hosted on
crash course ai’s GitHub..
This first bit of code in 1.1 loads it.
So if you wanted to try to make your AI sound like someone else, like Michael from Vsauce,
or me, this is where you’d load all that text instead.
Data gathering is often the hardest and slowest part of any machine learning project, but
in this instance its pretty straightforward.
Regardless, we still aren’t done yet, now we need to clean and prep our data for our
model.
This is called preprocessing.
Remember, a computer can only process data as numbers, so we need to split our sentences
into words, and then convert our words into numbers.
When we’re building a natural language processing program the term “word” may not capture
everything we need to know.
How many instances there are of a word can also be useful.
So instead, we’ll use the terms lexical type and lexical token.
Now a lexical type is a word, and a lexical token is a specific instance of a word, including
any repeats.
So, for example, in the sentence:
The goal of machine learning is to make a learning machine.
We have eleven lexical tokens but only nine lexical types, because “learning” and
“machine” both occur twice.
In natural language processing, tokenization is the process of splitting a sentence into
a list of lexical tokens.
In English, we put spaces between words, so let’s start by slicing up the sentence at
the spaces.
“Good morning Hank, it’s Tuesday.”
would turn into a list like this.
And we would have five tokens.
However there are a few problems.
Something tells me we don’t really want a lexical type for Hank-comma and Tuesday-period,
so let’s add some extra rules for punctuation.
Thankfully, there are prewritten libraries for this.
Using one of those, the list would look something like this.
In this case we would have eight tokens instead of five, and tokenization even helped split
up our contraction “it’s” into “it” and “apostrophe-s.”
Looking back at our code, before tokenization, we had over 30,000 lexical types.
This code also splits our data into a training dataset and a validation dataset.
We want to make sure the model learns from the training data, but we can test it on new
data it’s never seen before.
That’s what the validation dataset is for.
We can count up our lexical types and lexical tokens with this bit of code in box 1.3.
And it looks like we actually have about 23,000 unique lexical types.
But remember how many instances of a word can also be useful.
This code block here at step 1.4 allows us to separate how many lexical types occur more
than once twice and so on.
It looks like we’ve got a lot of rare words -- almost 10,000 words occur only once!
Having rare words is really tricky for AI systems, because they’re trying to find
and copy patterns, so they need lots of examples of how to use each word.
Oh Human John Green.
Your master of prose.
Let’s see what weird words you use.
Pisgah?
What even is a lilliputian?
Some of these are pretty tricky and are going to be too hard for John-Green-bot’s AI to
learn with just this dataset
But others seem doable if we take advantage of morphology.
Morphology is the way a word gets shape-shifted to match a tense, like you’d add an “ED”
to make something past tense, or when you shorten or combine words to make them totes-amazeballs.
Dear viewers, I did not write that in the script.
In English, we can remove a lot of extra word endings, like ED, ING, or LY, through a process
called stemming.
And so, with a few simple rules, we can clean up our data even more.
I’m also going to simplify the data by replacing numbers with the hashtag or pound signs. Whatever you want to call it.
This should take care of a lot of rare words.
Now we have 3,000 fewer lexical types and only about 8,000 words only occur once.
We really need multiple examples of each word for our AI to learn patterns reliably, so
we’ll simplify even more by replacing each of those 8,000 or so rare lexical tokens with
the word ‘unk’ or unknown.
Basically, we don’t want John-Green-bot to get embarrassed if he sees a word he doesn’t
know.
So by hiding some words, we can teach John-Green-bot how to keep writing when he bumps into a one-time
made-up words like zombicorns.
And just to satisfy my curiosity…
Yeah, John-Green-bot doesn’t need words like “whippersnappers” or “zombification”.
John what’s up with the fixation with zombies?
Anyway, we’ll be fine without them.
Now that we finally have our data all cleaned and put together, we’re done with preprocessing
and can move on to Step 2: setting up the model for John-Green-bot.
There are a couple key things that we need to do.
First, we need to convert the sentences into lists of numbers.
We want one word for every lexical type, so we’ll build a dictionary that assigns every
word in our vocabulary a number.
Second, unlike us, the model can read a bunch of words at the same time, and we want to
take advantage of that to help John-Green-bot learn quickly.
So we’re going to split our data into pieces called batches.
Here, we’re telling the model to read 20 sequences (which have 35 words each) at the
same time!
Alright!
Now, it’s time to finally build our AI.
We’re going to program John-Green-bot with a simple language model that takes in a few
words and tries to complete the rest of the sentence.
So we’ll need two key parts, an embedding matrix and a recurrent neural network or RNN.
Just like we discussed in the Natural Language Processing video last week, this is an “Encoder-Decoder”
framework.
So let’s take it apart.
An embedding matrix is a big list of vectors, which is basically a big table of numbers,
where each row corresponds to a different word.
These vector-rows capture how related two words are.
So if two words are used in similar ways, then the numbers in their vectors should be
similar.
But to start, we don’t know anything about the words, so we just assign every word a
vector with random numbers.
Remember we replaced all the words with numbers in our training data, so now when the system
reads in a number, it just looks up that row in the table and uses the corresponding vector
as an input.
Part 1 is done: Words become indices, which become vectors, and our embedding matrix is
ready to use.
Now, we need a model that can use those vectors intelligently.
This is where the RNN comes in.
We talked about the structure of a recurrent neural network in our last video too, but
it’s basically a model that slowly builds a hidden representation by incorporating one
new word at a time.
Depending on the task, the RNN will combine new knowledge in different ways.
With John-Green-bot, we’re training our RNN with sequences of words from Vlogbrothers
scripts.
Ultimately, our AI is trying to build a good summary to make sure a sentence has some overall
meaning, and it’s keeping track of the last word to produce a sentence that sounds like
English.
The RNN’s output after reading the final word so far in a sentence is what we’ll
use to predict the next word.
And this is what we’ll use to train John-Green-bot’s AI after we build it. All of this is wrapped up in code block 2.3
So Part 2 is done. We've got our embedding matrix and our RNN.
Now, we’re ready for Step 3: train our model.
Remember when we split the data into pieces called batches?
And remember earlier in Crash Course AI when we used backpropagation to train neural networks?
Well we can put those pieces together, iterate over our dataset, and run backpropagation
on each example to train the model’s weights.
So in step 3.1 we’re defining how to train our model and in step 3.2 we’re defining
how to evaluate our model and in step 3.3 we’re actually creating our model.
Which means training and evaluating it.
Over the span of one epoch of training this model, the network will loop over every batch
of data -- reading it in, building representations, predicting the next word, and then updating
its guesses.
This will train over 10 epochs, which might take a couple minutes.
We’re printing two numbers with each epoch, which are the model’s training and validation
perplexities.
As the model learns, it realizes there are fewer and fewer good choices for the next
word.
The perplexity is a measure of how well the model has narrowed down the choices.
Okay, it looks like the model is done training and has a perplexity of about 45 on train
and 72 on validation, but it started with perplexities in the hundreds!
We can interpret perplexity as the average number of guesses the model makes before it
predicts the right answer.
After seeing the data once, the model needed over 300 guesses for the next word, but now
it’s narrowed it down to fewer than 50.
That’s a pretty good improvement, even though it’s far from perfect.
Time to see what the model can write, but to do that, we need one final ingredient.
So far in Crash Course AI, we’ve talked a lot about the one best label or the one
best prediction an AI can make, but this doesn’t always make sense to solve certain problems.
If you wrote stories by always having characters do the next obvious thing, they’d be pretty
boring.
So Step 4 is inference, the part of AI where the machine gets to make some choices, but
we can still help it a little bit.
Let’s think about what the final layer of the RNN is actually doing.
We talk about it like it’s outputting a single label or prediction, but actually the
network is producing a bunch of scores or probabilities.
The most likely word has the highest probability, the next most likely word has the second highest
probability, and so on.
Because we get probabilities at every step, instead of taking the best one each time to
produce 1 sentence, we could sample 3 words and start 3 new sentences.
Each of those 3 sentences could then start 3 more new sentences… and then we have a
branching diagram of possibilities.
Inference is so important because what the model can produce and what we want aren’t
necessarily the same thing.
What we want is a really good sentence, but the model can only tell us the score for one
word at a time.
Let’s look at this branching diagram.
Whenever we choose a word, we create a new branch, and keep track of its score or probability.
If we multiply each score through to the end of the branch, we see that the top branch,
made the best scoring choice, but a worse sentence overall.
So we’re going to implement a basic sampler in our program.
This will take a bunch of random paths, so we can sort the results by the probability
of the full sentences, and we can see which sentences are best overall.
Also, when asking John-Green-bot to generate all these sentences, we need to give him a
word to start.
I’m going to try “Good” for now, but you can try other things by changing the code
in 4.1.
Remember the preprocessing we did on our data?
That's why these sentences look a little off, with hashtags for numbers, and the space
before word endings that we introduced when stemming.
And look at the sentence you get from taking the highest probability word each time.
Good morning Hank, it’s Tuesday.
I’m going to be like, I’m going to be like, I’m going to be like, I’m going to
see it isn’t as interesting as the ones where we
mixed it up a bit and took different branches.
To be honest though… none of these are great Vlogbrothers scripts.
That’s because of two important things:
First, there’s our data.
Remember, we didn’t have many examples of how to use each word.
In fact, we had to cut out a lot of “rare words” during training because they only
showed up once, so we couldn’t teach John-Green-bot to recognize any patterns related to them.
Lots of state-of-the-art models address this by downloading data from Wikipedia, large
collections of books, or even Reddit when they train their models.
We’ll include some links in the description if you want to play with some fancier models.
But the second, bigger issue is that AI models are missing the understanding we have as humans.
Even if John Green Bot split up words perfectly and predicted sentences that sound like English,
it’s still John-Green-bot using tools like tokenization, an embedding matrix, and a simple
language model to predict the next word.
When human John Green writes, he uses his understanding of the world, like in Vlogbrothers
videos, he considers Hank’s perspective or whoever’s watching.
He’s not just trying to predict which next word has the highest probability.
Building models that interact with people, and the world, is why natural language processing
is so exciting, but it’s also why it’ll take a lot more work to get John-Green-bot
to generate language as well as human John Green does.
We’ve left a bunch of notes in the code for you to play if you want to make your own
AI.
You can train for longer, change the sentence prompt, or, if you’re feeling adventurous,
replace the text data to speak in someone else’s voice.
If you end up using this to make something cool let us know in the comments.
Thanks for watching, see you next week.
PBS Digital Studios wants to hear from you.
We do a survey every year that asks what you're into, your favorite pbs shows, and things you
would like to see more from PBS Digital Studios. You even get to vote on potential new shows.
All of this helps us make more stuff that you want to see.
The survey takes about 10 minutes and you might win a sweet t-shirt. Link is in the description. Thanks.
Crash Course AI is produced in association with PBS Digital Studios!
If you want to help keep all Crash Course free for everybody, forever, you can join
our community on Patreon.
And if you want to learn more about NLP check out this video from Crash Course Computer Science.